{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Finetune_and_generate_RuGPTs_only_with_megatron.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pRzAXrPVzsHX"
      },
      "source": [
        "# Finetune RuGPTs in megatron without deepspeed\n",
        "How to finetune RuGPTs models with megatron without deepspeed. Example for RuGPT3Small. Note for other models it will take more GPU memory.\n",
        "\n",
        "This notebook is valid for all RuGPTs models except RuGPT3XL.\n",
        "## Install env"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!rm -rf /usr/local/cuda\n",
        "!ln -s /usr/local/cuda-10.1 /usr/local/cuda"
      ],
      "metadata": {
        "id": "22qKsT3KpKFv"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "%%bash\n",
        "export LD_LIBRARY_PATH=/usr/lib/"
      ],
      "metadata": {
        "id": "VAHq-7ZdpW5N"
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "!apt-get install clang-9 llvm-9 llvm-9-dev llvm-9-tools"
      ],
      "metadata": {
        "id": "ORYJEnvKpXId"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Vn0fHQmpTPcF",
        "outputId": "846b0f34-3ac6-46f4-d357-35a15e78fc28",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "import subprocess\n",
        "\n",
        "CUDA_version = [s for s in subprocess.check_output([\"nvcc\", \"--version\"]).decode(\"UTF-8\").split(\", \") if s.startswith(\"release\")][0].split(\" \")[-1]\n",
        "print(\"CUDA version:\", CUDA_version)\n",
        "\n",
        "if CUDA_version == \"10.0\":\n",
        "    torch_version_suffix = \"+cu100\"\n",
        "elif CUDA_version == \"10.1\":\n",
        "    torch_version_suffix = \"+cu101\"\n",
        "elif CUDA_version == \"10.2\":\n",
        "    torch_version_suffix = \"\"\n",
        "else:\n",
        "    torch_version_suffix = \"+cu110\""
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "CUDA version: 10.1\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Maf99CebV3oT"
      },
      "source": [
        "If code below doesn't work, check your cuda version and installation here https://pytorch.org/get-started/previous-versions/"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8uNRRWUaVQN0"
      },
      "source": [
        "!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ozJOYbK-11pk",
        "outputId": "fc392543-1ef0-476e-dd06-e26be5b0cd71"
      },
      "source": [
        "%%writefile setup.sh\n",
        "\n",
        "git clone https://github.com/NVIDIA/apex\n",
        "cd apex\n",
        "pip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Writing setup.sh\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M46Pk6DJ19Jk"
      },
      "source": [
        "!sh setup.sh"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hu1OzWZ6zqQv"
      },
      "source": [
        "!pip3 install transformers==3.5.0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_gE7xBM_z-uW"
      },
      "source": [
        "!git clone  https://github.com/sberbank-ai/ru-gpts"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8di4sCoS0Pyw"
      },
      "source": [
        "## Download files"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "96qG_A1n0CiF"
      },
      "source": [
        "!wget -O train.txt https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0\n",
        "!wget -O valid.txt https://www.dropbox.com/s/mworl3ld6r3bg62/valid.txt?dl=0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jqW38Hni64xH"
      },
      "source": [
        "## Prepare data for parallel\n",
        "We use custom implementation of distributed dataset. For training and evaluating we should specify file `file.list` with list of paths to txt files. All files from `file.list` will be splitted between aviable GPUs. The logic of splitting is described by the following code:\n",
        "\n",
        "```python\n",
        "shard_size = len(files) // world_size\n",
        "shard_start = rank * shard_size\n",
        "shard_end = (rank + 1) * shard_size\n",
        "files = files[shard_start:shard_end]\n",
        "```\n",
        "\n",
        "For more details please see full code of dataset: `src.dataset_rugpt3.RuGpt3TextDataset`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wtItuLGA38db"
      },
      "source": [
        "!echo train.txt > train.list\n",
        "!echo valid.txt > valid.list"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-EF0JepF0S41"
      },
      "source": [
        "## Train\n",
        "Load model from Huggingface and finetune on essays.\n",
        "\n",
        "This will take arount ten minutes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XHluAlFh0SJo"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!python ru-gpts/pretrain_gpt3.py \\\n",
        "  --train-data-path \"train.list\" \\\n",
        "  --test-data-path \"valid.list\" \\\n",
        "  --max-files-per-process 100 \\\n",
        "  --logging-dir=\"log\" \\\n",
        "  --save model \\\n",
        "  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --save-interval 1000 \\\n",
        "  --log-interval 100 \\\n",
        "  --eval-interval 1000 \\\n",
        "  --eval-iters 100 \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --batch-size 1 \\\n",
        "  --seq-length 512 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --train-iters 2000 \\\n",
        "  --resume-dataloader \\\n",
        "  --distributed-backend \"nccl\" \\\n",
        "  --lr 0.00015 \\\n",
        "  --lr-decay-style \"cosine\" \\\n",
        "  --lr-decay-iters 3200 \\\n",
        "  --clip-grad 0.5 \\\n",
        "  --warmup .004\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ALvcD5SE8RtP"
      },
      "source": [
        "At the end of training output should be something like this:\n",
        "\n",
        "\"-----------------------------------------------------------------------------------------\n",
        "\n",
        " validation loss at the end of training for test data | LM loss: 2.7927 | LM PPL: 16.325\n",
        " \n",
        "-----------------------------------------------------------------------------------------\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0HmKilrb8lQm"
      },
      "source": [
        "## Generate\n",
        "\n",
        "Load pretrained model from dir and generate."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kAH-WpCG8lmG",
        "outputId": "1a482aa4-9901-4193-bcc0-3c983d12af66"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!python ru-gpts/generate_samples.py \\\n",
        "  --load model/ \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --batch-size 1 \\\n",
        "  --seq-length 512 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --distributed-backend \"nccl\" \\\n",
        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --no-load-optim\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-08-13 15:00:38.346998: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0\n",
            "Generate Samples\n",
            "WARNING: No training data specified\n",
            "using world size: 1 and model-parallel size: 1 \n",
            " > using dynamic loss scaling\n",
            "> initializing model parallel with size 1\n",
            "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
            "prepare tokenizer done, size 50264\n",
            "building GPT3 model ...\n",
            " > number of parameters on model parallel rank 0: 125231616\n",
            "Load checkpoint from model/\n",
            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "  successfully loaded model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "Loaded\n",
            "\n",
            "Context prompt (stop to exit) >>> <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение:\n",
            "\u001b[H\u001b[2J\n",
            "Taken time 14.35\n",
            "\n",
            "\n",
            "Context: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение:\n",
            "\n",
            "GPT: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение: Главной гуманистической идеей в романе «Евгений Онегин» является вопрос о свободе воли. В «Гроза» лирический герой утверждает свою свободу, но лирическая героиня под линно выражает позицию, высказанную Цицероном: «Я считаю закон лишь законом, и этого не хочу делать». Автора интересует вопрос о свободе воли, но лирическая героиня выражает свою позицию о ней самой. Она говорит о своей свободе, но лириче ской позиции не предполагает ни свободы воли, ни свободы совести. С позиции Базарова невозможно ответить на вопрос, свободен ли он. Для Автора законом является «Свободу, но и свободу вы бираю я». Он освобождается от сво ей идеи, но лириче ской позиции он не может изменить ее. Этот вывод подтверждает, что если человек нарушает чьи-то права, то это не освобождает его от ответственности. Бакаров показывает истинное лицо Автора и показывает ему пример свободы. Автор показывает свое отношение к герою. Он показывает, что открытое письмо Беликова к знакомому нет лириче ской замышляемым путем, к герою, отрицающее его любовь, и, наконец, к автору. Зная ответ на письмо, автор пытается решить, можно ли назвать Катерину «странной». «Странные» чувства он вызывает в ней: «Угроза, предзнаменование», «бесконечное отвращение». Автор мо ж<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad\n",
            "\n",
            "Context prompt (stop to exit) >>> Traceback (most recent call last):\n",
            "  File \"ru-gpts/generate_samples.py\", line 106, in generate_samples\n",
            "    raw_text = input(\"\\nContext prompt (stop to exit) >>> \")\n",
            "KeyboardInterrupt\n",
            "\n",
            "During handling of the above exception, another exception occurred:\n",
            "\n",
            "Traceback (most recent call last):\n",
            "  File \"ru-gpts/generate_samples.py\", line 204, in <module>\n",
            "    main()\n",
            "  File \"ru-gpts/generate_samples.py\", line 200, in main\n",
            "    generate_samples(model, tokenizer, args)\n",
            "  File \"ru-gpts/generate_samples.py\", line 106, in generate_samples\n",
            "    raw_text = input(\"\\nContext prompt (stop to exit) >>> \")\n",
            "KeyboardInterrupt\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VCapfDfeBq0x"
      },
      "source": [
        "### Convert checkpoint to Huggingface format"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4JnhIyqd-Eeo",
        "outputId": "98b0c48c-9127-4cd9-d86a-09951f11851c"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!python ru-gpts/convert2huggingface.py \\\n",
        "  --load model/ \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --no-load-optim \\\n",
        "  --export-huggingface model_hf\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-08-13 15:02:22.859973: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0\n",
            "WARNING: No training data specified\n",
            "using world size: 1 and model-parallel size: 1 \n",
            " > using dynamic loss scaling\n",
            "> initializing model parallel with size 1\n",
            "prepare tokenizer done, size 50264\n",
            "building GPT3 model ...\n",
            " > number of parameters on model parallel rank 0: 125231616\n",
            "Load checkpoint from model/\n",
            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "  successfully loaded model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "Loaded\n",
            "Export to huggingface model  model_hf with config {'vocab_size': 50264, 'n_positions': 2048, 'n_ctx': 2048, 'n_embd': 768, 'n_layer': 12, 'n_head': 12}\n",
            "Saved huggingface model <class 'src.model.distributed.DistributedDataParallel'>\n",
            "Exported in huggingface format to model_hf\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ZIfhv7KvDKQi",
        "outputId": "e61937bd-fb49-4b9c-c1ce-d25083b8db76"
      },
      "source": [
        "!ls model_hf"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "config.json  pytorch_model.bin\t      tokenizer_config.json\n",
            "merges.txt   special_tokens_map.json  vocab.json\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KRlEwlPdE0L8"
      },
      "source": [
        "#### Test load"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5U81i24aEEm0"
      },
      "source": [
        "from transformers import GPT2LMHeadModel"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eBRatZnJEcCX"
      },
      "source": [
        "model = GPT2LMHeadModel.from_pretrained(\"model_hf\")"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}